import pandas as pd
import numpy as np
import plotly.express as px
Chapter 7 - Moving Beyond Linearity¶
Relaxing Assumption Of Linearity¶
- Recall previous notation, we have $p$ predictors $X=\left(X_{1}, X_{2}, \dots, X_{p}\right)\in\mathbb{R}^{n\times p}$ and a response $y\in\mathbb{R}^{n}$
- We assume that there is some relationship between $X$ and $y$ that we can represent as: $$y = f\left(X\right) + \epsilon$$
- Additionally, recall that we do not know the true $f$ and therefore are trying to estimate it
- Our estimate is denoted by $\hat{f}$. The value for the response produced by this estimated function is $\hat{y}$
- So far we have restricted the form of $\hat{f}$ to be linear:
$$\hat{y} = X\beta $$
- We now relax this assumption and build on the linear models to maintain interpretibility
Different methods¶
The following methods build on the linear regression methods
- Polynomial regression - adding extra predictors of the form $X^{d}$ (tend to use $d < 4$ otherwise unusual/unlikely shapes)
- Step functions - fitting a piecewise constant function, for $K$ distinct regions. Effectively creating a qualitative variable.
- Regression Splines - extension of 1 and 2. Split into $K$ regions and fit a polynomial. Constrained to join smoothly at the boundaries
- Smoothing Splines - similar to regression spline, but motivated differently. Arise from minimizing residual sum of squares subject to smoothness penalty
- Local regression - similar to splines but regions are allowed to overlap.
- Generalized Additive Models - generalizes the above methods to deal with multiple predictors
Polynomial Regression¶
$$ \hat{f}\left(x_0\right) = \sum_{j=0}^D \beta_jx_0^j $$
- Rule of thumb - $D < 4$ to avoid overfitting and strange shapes especially at the boundaries of $x$
- We will only be working with one feature and taking the polynomial expansion of that
- We can fit this using normal linear regression with features $X_1 = X, X_2 = X^2, \dots, X_D = X^D$
Example Standard Error (Logistic Regression)¶

Equations¶
Suppose $y$ is wage and $x$ is age
$$ y_i = \beta_0 + \beta_1 x_i + \beta_2x_i^2 + \beta_3x_i^3 + \beta_4 x_i^4 + \epsilon_i $$
$$ \mathbb{P}\left[y_i > 250 \middle| x_i\right] = \frac{\exp\left(\beta_0 + \beta_1 x_i + \beta_2x_i^2 + \beta_3x_i^3 + \beta_4 x_i^4\right)}{1 + \exp\left(\beta_0 + \beta_1 x_i + \beta_2x_i^2 + \beta_3x_i^3 + \beta_4 x_i^4\right)} $$
Side Track¶
- We can estimate the standard error of $\hat{f}\left(x_0\right)$ to be the squareroot of the variance of $\hat{f}\left(x_0\right)$
- Where $\mathbb{V}\left[\hat{f}\left(x_0\right)\right] = l_0'\mathbb{V}\left[\hat{\beta}\right]l_0$ where $l_0' = \left(1,x_0,x_0^2,...,x_0^d\right)$
- The standard error explodes because the size of $x_0^d$ will increase exponentially as $x_0$ increases
Step Functions¶
- We partition our feature space and fit a constant for each partition
- Doing this will result in the fitted function to be a step function
- Let us start with the following single variable dataset: | $y$ | $X$ | | -- | -- | | $y_1$ | $x_1$ | | $y_2$ | $x_2$ | | $y_3$ | $x_3$ | | $y_4$ | $x_4$ | | $y_5$ | $x_5$ | | $y_6$ | $x_6$ |
- Suppose we create the partition with ($x_1, x_2$), ($x_3, x_4$) and ($x_5, x_6$), ignoring the constant term we have:
| $y$ | $\tilde{X}_1$ | $\tilde{X}_2$ | $\tilde{X}_3$ |
|---|---|---|---|
| $y_1$ | 1 | 0 | 0 |
| $y_2$ | 1 | 0 | 0 |
| $y_3$ | 0 | 1 | 0 |
| $y_4$ | 0 | 1 | 0 |
| $y_5$ | 0 | 0 | 1 |
| $y_6$ | 0 | 0 | 1 |
Example Standard Error¶

Equations¶
Suppose $y$ is wage and $x$ is age
$$ y_i = \beta_0 + \beta_1 \tilde{x}_1 + \beta_2 \tilde{x}_2 + \beta_3 \tilde{x}_3 + \epsilon_i $$
$$ \mathbb{P}\left[y_i > 250 \middle| x_i\right] = \frac{\exp\left(\beta_0 + \beta_1 \tilde{x}_1 + \beta_2 \tilde{x}_2 + \beta_3 \tilde{x}_3\right)}{1 + \exp\left(\beta_0 + \beta_1 \tilde{x}_1 + \beta_2 \tilde{x}_2 + \beta_3 \tilde{x}_3\right)} $$
Note¶
- If there are no natural break points, you can miss key changes in the data
- Choosing the break points can be problematic
- We call the break points cut points or knots
Piecewise Polynomials¶
- This is combining the two previous concepts where we will use a polynomial expansion on our feature but then fit separate models for each partition
- Example:
$$ y_i = \begin{cases} \beta_{01} + \beta_{11}x_i + \beta_{21}x_i + \beta_{31}x_i + \epsilon_i & \text{if } x_i < c\\ \beta_{02} + \beta_{12}x_i + \beta_{22}x_i + \beta_{32}x_i + \epsilon_i & \text{if } x_i \geq c\\ \end{cases} $$
Basis Functions¶
- Polynomial and step functions are special cases of basis function approach, all of these benefit from using the tools of regression
- Suppose we have functions $b_1\left(X\right), \dots, b_k\left(X\right)$ which are fixed and known ahead of time which form the new set of features
- Basis functions should be the smallest set of functions which spans the space of $f$, generally this would be an infinitely dimensional space but we will take a truncated basis
- Examples are wavelets and Fourier series as well regression splines (next section)
x = np.linspace(0, 1, 1_000)
D = 10
(
pd.DataFrame({
"x": x,
**{f"x^{d}": np.power(x, d) for d in range(D)},
})
.pipe(px.line, x="x", y=[f"x^{d}" for d in range(D)], height=600, title=f"Monomial Basis Functions Up To Degree {D - 1}")
)
x = np.linspace(0, 2, 1_000)
D = 10
(
pd.DataFrame({
"x": x,
**{f"x^{d}": np.power(x, d) for d in range(D)},
})
.pipe(px.line, x="x", y=[f"x^{d}" for d in range(D)], height=600, title=f"Monomial Basis Functions Up To Degree {D - 1}")
)
x = np.linspace(0, 1, 1_000)
partitions = 5
(
pd.DataFrame({
"x": x,
**{f"p{i}": np.where((i / partitions <= x) & (x <= (i + 1) / partitions), 1, 0) for i in range(partitions)},
})
.pipe(px.line, x="x", y=[f"p{i}" for i in range(partitions)], height=600, title=f"Step Functions Partitioning Space into {partitions}", line_shape="hv")
)
x = np.linspace(0, 1, 1_000)
harmonics = 5
(
pd.DataFrame({
"x": x,
**{f"cos{i}": np.cos(np.pi * (i + 1) * x) for i in range(harmonics)},
**{f"sin{i}": np.sin(np.pi * (i + 1) * x) for i in range(harmonics)},
})
.pipe(px.line, x="x", y=[f"cos{i}" for i in range(harmonics)] + [f"sin{i}" for i in range(harmonics)], height=600, title=f"Fourier Basis Functions Up To {harmonics} Harmonics")
)
x = np.linspace(0, 1, 1_000)
num_peaks = 5
(
pd.DataFrame({
"x": x,
# Source - https://courses.cs.washington.edu/courses/cse446/20wi/Lecture7/07_BasisFunctions.pdf
**{f"gaussian{i}": np.exp(-np.power(x - i / num_peaks, 2) / (2 * 0.05) ** 2) for i in range(num_peaks + 1)},
})
.pipe(px.line, x="x", y=[f"gaussian{i}" for i in range(num_peaks + 1)], height=600, title=f"Guassian Basis Functions")
)
Piecewise Verses Spline¶

Regression Splines¶
- Degree $D$ spline is a piecewise polynomial with degree $D$ with continuity in derivatives up to degree $D-1$
- We can represent the cubic spline using truncated power basis functions defined as:
$$\begin{align} f\left(x\right) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \sum_{k=1}^K\beta_{k+3} \left(x-\xi_k\right)^3_+ \end{align}$$
- Where $\xi_k$ denotes the knots for $k=1,\dots, K$ and
$$ \left(x-\xi_k\right)^3_+ = \begin{cases} \left(x-\xi_k\right)^3 & \text{if } x>\xi_k\\ 0 & \text{otherwise}\\ \end{cases} $$
- Exercise 1 in conceptual exercises shows that with one knot, the truncated power basis functions represent cubic spline
- With the truncated power basis function representation, it is easy to see how to fit linear regression with it
- This generalizes to order $D$ splines
Degrees of Freedom¶
- For a general linear regression the number of degrees of freedom is $p$ where $p$ is number of features (including the intercept)
- For a cubic spline, we have the features $x^0, x^1, x^2, x^3$ and $\left(x-\xi_k\right)^3$ where $k=1,\dots,K$ hence the number of degrees of freedom is $K+4$
- For piecewise polynomial regression with $K$ knots, we have to fit $K+1$ different models with degree $D$ polynomials, giving us $\left(D + 1\right)\times\left(K+1\right)$ features and therefore number of degrees of freedom. Too flexible!
- Each constraint we add gives one less degree of freedom, therefore for the cubic case, we add three constraints - continuity, first derivative continuity and second derivative continuity
- For one knot, top left chart we start with $8 = 4 \times 2$ degrees of freedom for piecewise cubic, and by adding the constraints for the spline, we end up with $ 8 - 3 = 5$ degrees of freedom bottom left
Natural Cubic Spline¶
- Extrapolate linearly beyond the boundary knot, adding 4 (2 edge partitions and 2 fewer degrees of polynomials i.e. no quadratic and cubic) extra constraints

Choosing Knots¶
- Place more knots where the functions varies more rapidly (maybe use rolling standard deviation to define this)
- Place knots in uniformly, e.g. quantiles of the data
- Cross validation for the number of knots
Smoothing Splines¶
- We start with the following loss function: $$\text{RSS} = \sum_{i=1}^n\left(y_i - g\left(x_i\right)\right)^2$$
- If $g\left(x\right)$ is unconstrained then $\text{RSS}$ can be made to be zero but we overfit. The function in this case is unlikely to be smooth
- Therefore, we could minimize $\text{RSS}$ subject to the constraint that $g$ is smooth
- The loss function for smoothing splines is given by:
$$\sum_{i=1}^n\left(y_i - g\left(x_i\right)\right)^2 + \lambda \int g''\left(t\right)^2dt$$
- $\lambda$ is the smoothing penalty, as $\lambda$ increases, $g$ will be smoother, it controls the bias variance tradeoff
- $g''$ is approximately the roughness of the function, $g''$ is zero for a straight line, i.e. as smooth as it gets
- $\int g''\left(t\right)^2dt$ is measuring the total change in $g'$ over the range
Smoothing Spline Solution¶
- $g\left(x\right)$ which minimizes the smooth penalty loss function has the following properties:
- It is a piecewise cubic polynomial with knots at unique values of $x_1,\dots,x_n$
- It is continuous at the first and second derivatives at the knots
- It is linear outside of the boundary knots
- Therefore, it is a natural cubic with knots at $x_1,\dots,x_n$
- It is not the same natural cubic spline that comes from the basis function approach, it is a shrunken version where $\lambda$ controls the level of shrinkage
Choosing $\lambda$¶
- $\lambda$ controls the smoothness and hence the degrees of freedom
- As $\lambda$ increases from zero to infinity, the effective degrees, $df_\lambda$, of freedom decrease from $n$ to 2
- Use cross validation to choose $\lambda$, can use leave one out which efficient for smoothing splines
- You could choose degrees of freedom rather than $\lambda$
Local Regression¶
- $s$ is denoted as span which is the proportion of points used to compute the local regression. It controls the flexibility of the regression
- Algorithm for $X=x_0$
- Take $s=k/n$ data points closest to $x_0$
- Assign a weight $K_{i,0} = K\left(x_i,x_0\right)$ to each point in the neighborhood where the furthest point 0 and the closest has the highest weight
- Fit a weighted least squares regression of $y_i$ on $x_i$ finding $\beta$ that minimizes: $$\begin{align}\sum_{i=1}^n K_{i,0}\left(y_i - \beta_0 - \beta_1x_i\right)^2\end{align}$$
- The fitted value at $x_0$ is given by $\hat{f}\left(x_0\right) = \hat{\beta}_0 + \hat{\beta}_1 x_0$
- Variants:
- Different weighting functions
- You could do a polynomial expansion on the feature

Generalized Additive Models¶
- Extend multiple linear regression by allowing non linear functions
- We replace the standard linear equation with smooth non linear functions in the following way:
$$\begin{align}y_i = \beta_0 + \sum_{j=1}^p f_j\left(x_{ij}\right) + \epsilon_i\end{align}$$
- It is called an additive model we calculate a separate $f_j$ for each $X_j$ and then add together all their contributions
- We can easily do a spline via this method but not smoothing splines
- Use back fitting to fit GAMs using smoothing splines
Summary of Methods¶
| Method | Advantages | Disadvantages |
|---|---|---|
| Polynomial regression | Simple way to provide non linearity | Notorious tail behaviour, bad for extrapolation - limit to degree less than 4 |
| Step function | Simple and popular in biostatistics and epidemiology | Miss relationships at the break points, choosing the breakpoints can be difficult |
| Piecewise polynomial | Discontinuities at break points. High degrees of freedom | |
| Spline | No need to go beyond degree 3 as the discontinuity is not visible to human eye. Introduce flexibility without adding too many degrees of freedom (leads to more stable estimates compared to piecewise polynomial) | High variance on outer range (improved by natural spline) |
| Smoothing Spline | It is a natural cubic spline. Do not need to specify knots. Can specify $\lambda$ through degrees of freedom. | More complicated method to calculate |
| Local regression | Allows to fit a model for global in some variables and local in others | Need all training data for prediction. Performs poorly for $p>3$ due to curse of dimensions |
| Generalized additive model | We can see the contribution of each $X_j$ onto $y$. More accurate predictions compared to linear. | Main limitation is that it has to be additive |